Modeling with dependent failures
نویسنده
چکیده
My broad research interest is in dependable systems, in particular developing fault-tolerant distributed algorithms and applying them to practical problems. Developing dependable systems is an important goal as we increasingly rely upon large-scale wide-area distributed systems to support a wide range of online services. As systems scale in size and extent, efficiently coping with failures is an essential requirement for providing highly-available services. Tolerating failures efficiently depends upon the type and frequency of failures in the system, and the failure model assumed to account for them. A typical assumption when designing dependable systems is that failures are independent. This assumption, however, is increasingly unrealistic for emerging distributed systems because individual events can cause extensive dependent failures. For example, faults in shared infrastructure, such as power and cooling in machine rooms and hosting centers, can incapacitate multiple racks of machines at a time. Faults in shared resources, such as physical network connections and exchanges, can introduce severe network partitions. Large-scale Internet attacks, such as viruses and worms, can lead to wide-spread exploitation and failure of hundreds of thousands of hosts. Given the prevalence and severity of dependent failures, developing methods for designing systems that explicitly model dependent failures has the potential to improve their efficiency and reliability. By explicitly modeling dependent failures, distributed systems can improve their efficiency by using fewer replicas or reducing their communication latency and overhead. They can improve, for instance, availability for the same number of replicas compared to replica selection assuming independent failures. When designing algorithms, a common conservative assumption is that it is possible to determine a threshold t on the number of process failures, where processes abstract units of computation in the system. This assumption is conservative because distinct subsets of processes of the same size might have different failure probabilities, and the value of the threshold must be the size of the largest subset that achieves a desired level of availability. Using this “threshold model” is adequate when failures are independent and identically distributed: all subsets of size t have the same failure probability. However, if this assumption does not hold and we still use a threshold, then we might be missing the opportunity of using fewer replicas. This observation leads to the following questions:
منابع مشابه
MTBF evaluation for 2-out-of-3 redundant repairable systems with common cause and cascade failures considering fuzzy rates for failures and repair: a case study of a centrifugal water pumping system
In many cases, redundant systems are beset by both independent and dependent failures. Ignoring dependent variables in MTBF evaluation of redundant systems hastens the occurrence of failure, causing it to take place before the expected time, hence decreasing safety and creating irreversible damages. Common cause failure (CCF) and cascading failure are two varieties of dependent failures, both l...
متن کاملExtracting Manufacturing Process Map in the Form of the IDEF Model Prerequisite for the Implementation of PFMEA in the Sugar Industry
The purpose of this research is to demonstrate the importance of extracting business process mappings as a prerequisite for the implementation of the PFMEA (Process Failure Mode and Effect Analysis). In the first stage, 30 production process failures were extracted in the meetings with factory managers. Then, a team was formed by the presence of process owners, and with the help of the project ...
متن کاملPhysical and theoretical modeling of rock slopes against block-flexure toppling failure
Block-flexure is the most common mode of toppling failure in natural and excavated rock slopes. In such failure, some rock blocks break due to tensile stresses and some overturn under their own weights and then all of them topple together. In this paper, first, a brief review of previous studies on toppling failures is presented. Then, the physical and mechanical properties of experimental mode...
متن کاملCalculation and Analysis of Reliability with Consideration of Common Cause Failures (CCF) (Case Study: The Input of the Dynamic Positioning System of a Submarine)
Abstract The reliability and safety of any system is the most important qualitative characteristic of a system. This qualitative characteristic is of particular importance in systems whose functions are under various stresses, such as high temperature, high speed, high pressure, etc. A considerable point, which is rarely taken into account when calculating the reliability and safety of syst...
متن کاملTransient Analysis of M/M/R Machining System with Mixed Standbys, Switching Failures, Balking, Reneging and Additional Removable Repairmen
The objective of this paper is to study the M/M/R machine repair queueing system with mixed standbys. The life-time and repair time of units are assumed to be exponentially distributed. Failed units are repaired on FCFS basis. The standbys have switching failure probability q (0≤q≤1). The repair facility of the system consists of R permanent as well as r additional removable repairmen. Due to i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006